The purpose of the experiment was to see whether Bayesian Search could be used for debugging. Of especial interest to me is, whether the method of Bayesian Search is still suitable for low fault probabilties or large potential fault locations.
Underlying this method is Bayesian Search theory.
Git Bisect is a valuable tool for many large projects [Citation Needed]. However, it falls short for non-deterministic bugs. Therefore, a tool like BBChop might be of importance.
fail_loc
in the data) or which fails with probability fail_prob
if the tested location is larger or equal to fail_loc
.where
).virtualenv
with mpmath
as a further package installed. The visualisation and analysis needs pandas
and seaborn
packages.analysis.py
with a list of hostnames as inputs. As a result, the hardcoded configurations are distributed among the hostnames with each hostname receving one set of configurations. The result is written to a <hostname>.json
file. compute.py
with one of these JSON files as input will run the experiment on each configuration in the input file.Stochastic Root Finding
We can treat the problem as one of Stochastic Root Finding. If we model the test to be a function of location returning -1 for a failure and +1 for a success, then we are searching for the root of this function, which equals the first faulty location.
http://www3.stat.sinica.edu.tw/statistica/oldpdf/A17n414.pdf ADAPTIVE DESIGNS FOR STOCHASTIC ROOT-FINDING, V. Roshan Joseph1, Yubin Tian2 and C. F. Jeff Wu in Statistica Sinica 17(2007), 1549-1565
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.471.3358&rep=rep1&type=pdf A Bayesian Approach to Stochastic Root Finding, Rolf Waeber, Peter I. Frazier and Shane G. Henderson in 2011 Winter Simulation Conference, Phoenix, AZ
Evolving code bases sometimes suffer from regressions. Regressions can be defined as something that once worked and now does not work anymore.
Searching for regressions in code bases can very well be done with Git Bisect 2. As a prerequisite, one needs a test to determine if the fault is present or not. Also, the last version should be faulty and a previous version should be non-faulty.
Then, a binary search can be done over the h istory. This is extremely fast, because it requires only log(N) iterations. However, the prerequisite was that we can test for the faulty-ness.
However, not all faults are deterministic. Some faults are 'intermittent' or 'stochastic'. In this analysis, I will evaluate code on how to find the location of a fault given the stochastic behavior of a fault.
In 2009, "Ealdwulf Wuffinga" announced 1 on the git mailing list, that he had implemented a more general algorithm to finding bugs. He called it BBChop, for Bayesian Binary Chop. It is based on Bayesian Search Theory.
BBChop is like git bisect (or equivalent), but works when your bug is intermittent. That is, it works in the presence of false negatives (when a version happens to work this time even though it contains the bug). 3
I used BBChop and ran experimental runs with varying parameters and recorded how many trials where necessary to identify faults.
In [1]:
%matplotlib inline
import json
import pandas as pd
import seaborn as sn
In [2]:
with open('results.json') as f: d = json.load(f)
In [3]:
df = pd.DataFrame.from_dict(d)
df.head()
Out[3]:
You can see that there are the following columns:
N
: Size of the range to search. For example, the number of commits in a code base. I used 10, 100, and 1000.certainty
: How certain to be before terminating the search. I used two values here. 0.9 and 0.7 (90% and 70%, respectively).fail_loc
: The location where the first failure occured. I used 0.7 * N
, and 0.5 * N
.fail_prob
: The rate of failure, if the fault is present. So after fail_loc
it will fail with probability fail_prob
, before fail_loc
it will never fail.seed
: The pseudo-random number generator's seed. I used this, so that results would be reproducible.success
: If the found location (where
) equals the fail_loc
.trials
: How many trials were needed to come to a solution.where
: Where the fault was thought to be. If that location is the true faulty location, then success == True
.
In [22]:
sn.lmplot(x='fail_prob', y='trials', data=df.query('N in [10, 100, 1000]'), hue='N', fit_reg=False)
Out[22]:
In [5]:
sn.distplot(df['seed'])
Out[5]:
Not very helpful, the distribution of seeds. Basically between 1 and 9.
Trials vs Probability is not very informative. Not even, when I separate by 'N', the size of my history.
But I am suspecting that there is a bug for small probabilities. Let's filter those out.
In [23]:
df_large_prob = df[df.fail_prob > 0.1]
sn.lmplot(x='fail_prob', y='trials', data=df_large_prob.query('N in [10, 100, 1000]'), hue='N')
Out[23]:
That is more like it.
Next, let's do a violin plot to see the distribution at each probability.
In [21]:
sn.violinplot(x='fail_prob', y='trials', hue='N',
data=df_large_prob.query('N in [10, 100, 1000]'))
Out[21]:
Can you see anything? I can't.
But, I think, I'd like to see the same for only N = 1000
and split
the Violin Plot on success
. Because, honestly, I don't care much about success == False
cases.
In [8]:
df_N1000 = df.query('N == 1000')
sn.violinplot(x='fail_prob', y='trials', hue='success', data=df_N1000, split=True)
Out[8]:
Very nice! First observation: Probabilities below 0.1 are different. Just, to be save, let's do a log-plot. Second observation: At low probabilities success or failure makes all the difference to the number of trials.
In [9]:
import matplotlib.pyplot as plt
f, ax = plt.subplots(figsize=(7, 7))
ax.set(yscale="log")
sn.violinplot(x='fail_prob', y='trials', hue='success', data=df_N1000, split=True, ax=ax)
Out[9]:
That looks certainly better, but for now, I want to limit my observation to p in [0.1, 1.0]. And stay with a linear axis.
Next, loet's repeat the step from above and show trials vs probabilites for N = 1000, and p >= 0.1
In [10]:
df_N1000_large_p = df.query('N == 1000').query('fail_prob >= 0.1')
sn.violinplot(x='fail_prob', y='trials', hue='success', data=df_N1000_large_p, split=True)
Out[10]:
My colleague saw this the other day, and asked whether the distributions are bimodal. So, how many data points are there for each violin-plot?
In [11]:
df.query('N == 1000').groupby('fail_prob').size()
Out[11]:
192 values for 0.250. That seems enough.
Next, I want to produce another plot. Trials vs Commits to search, parameterised over p.
In [12]:
sn.lmplot(x='N', y='trials', data=df, hue='fail_prob')
Out[12]:
Not useful at all.
log(N)
regression. Why log(N)
, because that is what I get with binary search.Let's thin the plot down a bit:
In [13]:
sn.lmplot(x='N', y='trials', data=df.query('fail_prob in [0.1, 0.5, 1.0]'), hue='fail_prob')
Out[13]:
In [14]:
# Better, but let us try with only 'success'
locs = [int(0.5 * n) for n in set(df['N'])]
sn.lmplot(x='N', y='trials', data=df.query('fail_prob in [0.1, 0.5, 1.0] \
and fail_loc in {} \
and certainty == 0.9 \
and success == True'.format(locs)),
hue='fail_prob', fit_reg=False)
Out[14]:
In [15]:
# df2: A view of [N, trials]
df2 = df.query('fail_prob == 1.0 and success == True and certainty == 0.7')[['N', 'trials']]
df2.insert(len(df2.columns), 'algorithm', 'BBChop')
# Git Bisect performance.
x = list(range(10, 1000, 100))
from math import ceil, log2
y = list(map(lambda n: ceil(log2(n)), x))
df3 = pd.DataFrame.from_items([('N', x), ('trials', y), ('algorithm', 'Git Bisect')])
sn.lmplot(x='N', y='trials', data=pd.concat([df2, df3]), hue='algorithm', fit_reg=False)
Out[15]:
In [16]:
# Is the certainty accurate?
for c in [0.7, 0.9]:
c_70 = df.query('certainty == {}'.format(c)).groupby('success').size()
print('Success rate of algorithm: {}; expected {}.'.format( c_70[True] / (c_70[True] + c_70[False]), c))
When running a test provided with BBChop, I noticed that for large number of detections, Beta(d+1, t+1) * P(L)
evaluates to 0, for d, t
being large (~ 640). This might be related to the poor performance at p < 0.1
. [It is not, as my analysis later on shows.]
Developers with practical experience in probability algorithms know, that it is often advisable for numerical stability to use the logarithm of probabilities rather than raw probabilities. The developer of BBChop knew this too, which is obvious from his use of logarithms inside the math functions, and of his use of Fraction module to avoid floats.
However, Beta
is implemented as
def Beta(a, b):
return exp(log(Gamma(a)) + log(Gamma(b) - log(Gamma(a+b))
It was not obvious to me, how to make use of logarithms all the way through the computations. In retrospect, only a small change was required to fix the observed behaviour.
There is the line for li in likelihoods: probs.append(li/likelihoodTot)
, with likelihoodTot
being the sum over some Beta(., .) * P(.)
.
One possible solution would be to use the identity log(a + b) = log(a) + log(1 + b/a)
.
http://stats.stackexchange.com/a/66621 : Suggests to "Substract the maximum logarithm from all logs."
This was indeed the solution.
\begin{align} P_i &= \frac{L_i}{\sum_i L_i} \end{align}$\DeclareMathOperator*{\argmax}{argmax}$
Let $L_i$ be scaled by $k$, one sees that the above equation still holds. Thus $\log L_i = \log( \tilde{L_i} k) = \log(\tilde{L_i}) + \log(k)$. Now, choose $\log(k)$ such that $\exp(\log(L_i))$ does not underflow. This can be achieved by making the largest value equal to 1. As such, for $i = \argmax \log L_i$, $\exp(\log L_i ) = 1 \Leftrightarrow$ $\log L_i = 0 \Leftrightarrow$ $\log \tilde{L_i} = - \log k$.
I implemented it and the code that erred now passes.
In a later section I show that this did not result in different results. However, my test data does not consist of so large number of observations. Remember the bug happens when
This fails for B(638 + 1, 664 + 1).
where 638 and 664 are the number of faulty and non-faulty (or absolute?) observations at a given location.
As mentioned above, not using Log Likelihoods was leading at a high number of observations to an underflow in the exponential.
Does this explain the bad performance at low probabilities?
In [17]:
with open('results_non_logarithm.json', 'r') as f: d = json.load(f)
df_non_log = pd.DataFrame.from_dict(d)
df_non_log.head()
Out[17]:
In [18]:
# Maybe a good comparison between Log vs Linear is for each config plot trials[log] vs trials[lin]?
# Drop duplicates
df_non_log.drop_duplicates(inplace=True)
merge = pd.merge(df, df_non_log, on = ['N', 'certainty', 'fail_loc', 'fail_prob', 'seed', 'where'], how='inner')
trials_log = merge['trials_x']
trials_lin = merge['trials_y']
sn.lmplot(x='trials_x', y='trials_y', data = merge, fit_reg=False)
Out[18]:
In [19]:
non_equals = merge.query('trials_x != trials_y')
assert(len(non_equals) == 0)
Algorithm:
# Run binary search on the history.
1. Pick the midpoint of the history.
2. Run the test.
3. While test passes and number of test runs at current location is less then $k$, continue running the test at current location.
4. Mark the current midpoint as either failed or passed according to the result from line 3.
5. Compute the next midpoint and continue from line 1, if there is still a midpoint.
This algorithm would have the following characteristics.
The certainty of not misclassifying a location $c_i$ is $c_i \lt 1 - (1 - p)^k$. The certainty of not misclassifying any location is $c = \Pi_i^n c_i = \Pi_i^n 1 - (1-p)^k = (1 - (1-p)^k)^n$. $p$ is the fault probability, $n$ is the number of locations to evaluate. As such we have $n = \left\lceil \log_2 N \right\rceil$. We can re-arrange this to find the number of tests that we have to evaluate at each location $k$. $k = \left\lceil \frac{\log (1 - \exp(\log c / n) )}{\log (1-p)} \right\rceil$ (Valid for $p \lt 1$.)
The maximum number of trials to undertake become:
\begin{align} k n &= \left\lceil \frac{\log (1 - \exp(\log c / n) )}{\log (1-p)} \right\rceil n \\ \end{align}
In [20]:
p = 0.5
certainty = 0.9
condition = 'fail_prob == {p} and success == True and certainty == {cert}'.format(p=p, cert=certainty)
df2 = df.query(condition)[['N', 'trials']]
df2.insert(len(df2.columns), 'algorithm', 'BBChop')
# Naive Chop performance.
x = list(range(10, 1000, 100))
from math import ceil, log2, log, exp
f_n = lambda n: ceil(log2(n))
numerator = lambda n: log(1 - exp(log(certainty) / n))
f = lambda n: ceil(numerator(n) / log(1-p)) * f_n(n)
# Average case
# f = lambda n: ceil(log2(n)) * (ceil(0.5 * 1. / p) + ceil(0.5 * log(1-certainty)/log(1-p)))
y = list(map(f, x))
df3 = pd.DataFrame.from_items([('N', x), ('trials', y), ('algorithm', 'Naive Chop')])
sn.lmplot(x='N', y='trials', data=pd.concat([df2, df3]), hue='algorithm', fit_reg=False)
Out[20]:
From this it may safely be assumed that BBChop has better complexity than Naive Chop. However, this is less clear for the expected number of trials needed.